class: center, middle, inverse, title-slide # Lecture 21 ## Multiple Linear Regression ### Psych 10 C ### University of California, Irvine ### 05/18/2022 --- ## Linear regression - Last class we finished our example with multiple linear regression in a mental rotation task. -- - Today we will look at another example, however, this time we will not have a hypothesis, so we will have to take a "brute force" approach. -- - We are interested in studying the effects of age and height on blood pressure. -- - We are not sure if only one or both of these variables are good predictors; so we want to compare all the models we can build using these two variables. --- ## Data - Now that we have a research question (are height and age good predictors of blood pressure), we need to look at the data in the study. -- - There are 50 participants in this study, all of whom had their blood pressure taken during a routine check up. The average blood pressure of participants was 116.42 (mmHg), with a range 92 to 144. -- - The age of the participants ranged from 20 to 70, with an average of 44.64 years. -- - The height of participants ranged from 58.3 to 75.8 with an average of 66.894 inches. -- - From those participants in the study 25 are female and the rest are male. --- ## Data - Now that we have a description of the data we can visualize our observations using a scatter plot, in this case we are interested in two variables (age and height) so we can make two independent graphs. -- .pull-left[ <img src="data:image/png;base64,#lec-21_files/figure-html/blood-age-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#lec-21_files/figure-html/blood-height-1.png" style="display: block; margin: auto;" /> ] --- ## Data - From the previous graphs it seems that both height and age could be associated with the blood pressure of participants. -- - However, we can't draw conclusions from a plot, we need to test linear models in order to tell if our independent variables are good predictors. -- - Given that we have 2 independent variables (without taking into account the sex of participants which is categorical), we can compare 4 models. -- - When we **only** have continuous variables as predictors we don't model interactions because there is no way to interpret them. --- ## Models - The 4 models that we need to compare are: -- 1. **Null model**: Blood pressure is constant regardless of age and height of a participant `$$\text{blood-pressure}_i \sim \text{Normal}(\beta_0, \sigma^2_1)$$` -- 1. **Age model**: Blood pressure is a linear function of the age of the participant `$$\text{blood-pressure}_i \sim \text{Normal}(\beta_0 + \beta_1 \text{age}_i, \sigma^2_2)$$` -- 1. **Height model**: Blood pressure is a linear function of the height of the participant `$$\text{blood-pressure}_i \sim \text{Normal}(\beta_0 + \beta_2 \text{height}_i, \sigma^2_3)$$` -- 1. **Age & Height model**: Blood pressure is a linear function of the age and height of the participant `$$\text{blood-pressure}_i \sim \text{Normal}(\beta_0 + \beta_1 \text{age}_i + \beta_2 \text{height}_i, \sigma^2_4)$$` --- ## Predictions and errors - As we have done before, we want to calculate the predictions and errors of each model in order to make a comparison and select the best one. -- - Once we have the model that accounts for the data better, we will look at the distribution of the difference between observations and model's predictions `\((\hat{\epsilon}_i)\)` to evaluate the adequacy of the model. --- ## Null model .pull-left[ ```r # Total sample size n_total <- nrow(pressure) # Prediction of the null model null <- pressure %>% summarise("pred" = mean(blood_pressure)) %>% pull(pred) # Adding prediction and error of null to the data pressure <- pressure %>% mutate("prediction_null" = null, "error_null" = (blood_pressure - prediction_null)^2) # Calculating SSE of the null model sse_null <- sum(pressure$error_null) # Calculate Mean SE of the null model mse_null <- 1/n_total * sse_null # Calculate the BIC of the null model bic_null <- n_total * log(mse_null) + 1 * log(n_total) ``` ] .pull-right[ <br> <br> <br> <img src="data:image/png;base64,#lec-21_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> ] - The estimate of the intercept was `\(\hat{\beta}_0\)` = 116.42 --- ## Age model .pull-left[ ```r # Get estimators for beta0 and beta1 betas_age <- lm(formula = blood_pressure ~ age, data = pressure)$coef # Adding prediction and error for the age lm to the data pressure <- pressure %>% mutate("prediction_age" = betas_age[1] + betas_age[2] * age, "error_age" = (blood_pressure - prediction_age)^2) # Calculating SSE for the age lm sse_age <- sum(pressure$error_age) # Calculate Mean SE for the age lm mse_age <- 1/n_total * sse_age # Calculate the value of R^2 for the age lm r2_age <- (sse_null - sse_age) / sse_null # Calculate the BIC for the age lm bic_age <- n_total * log(mse_age) + 2 * log(n_total) ``` ] .pull-right[ <br> <br> <br> <img src="data:image/png;base64,#lec-21_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] - The estimate of the intercept was `\(\hat{\beta}_0\)` = 91.7, the estimate of the slope associated with age was `\(\hat{\beta}_1\)` = 0.55. --- ## Height model .pull-left[ ```r # Get estimators for beta0 and beta1 betas_height <- lm(formula = blood_pressure ~ height, data = pressure)$coef # Adding prediction and error for the age lm to the data pressure <- pressure %>% mutate("prediction_height" = betas_height[1] + betas_height[2] * height, "error_height" = (blood_pressure - prediction_height)^2) # Calculating SSE for the age lm sse_height <- sum(pressure$error_height) # Calculate Mean SE for the age lm mse_height <- 1/n_total * sse_height # Calculate the value of R^2 for the age lm r2_height <- (sse_null - sse_height) / sse_null # Calculate the BIC for the age lm bic_height <- n_total * log(mse_height) + 2 * log(n_total) ``` ] .pull-right[ <br> <br> <br> <img src="data:image/png;base64,#lec-21_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] - The estimate of the intercept was `\(\hat{\beta}_0\)` = 22.21, the estimate of the slope associated with height was `\(\hat{\beta}_2\)` = 1.41. --- ## Multiple linear regression: Age + Height - The last of the four models is the multiple linear regression that includes age and height as predictors of blood pressure: .pull-left[ ```r # Get estimators for beta0 and beta1 betas_ah <- lm(formula = blood_pressure ~ age + height, data = pressure)$coef # Adding prediction and error for the age lm to the data pressure <- pressure %>% mutate("prediction_ah" = betas_ah[1] + betas_ah[2] * age + betas_ah[3] * height, "error_ah" = (blood_pressure - prediction_ah)^2) # Calculating SSE for the age lm sse_ah <- sum(pressure$error_ah) # Calculate Mean SE for the age lm mse_ah <- 1/n_total * sse_ah # Calculate the value of R^2 for the age lm r2_ah <- (sse_null - sse_ah) / sse_null # Calculate the BIC for the age lm bic_ah <- n_total * log(mse_ah) + 3 * log(n_total) ``` ] .pull-right[ <br> <br> <br> | Estimate | Value | |----------|:-----:| | `\(\hat{\beta}_0\)` | -14.26 | | `\(\hat{\beta}_1\)` | 0.59 | | `\(\hat{\beta}_2\)` | 1.56 | ] --- ## Model comparison - Now we can compare our models using a table: | Model | Parameters | MSE | `\(R^2\)` | BIC | |-------|:----------:|:---:|:-----:|:---:| | Null | 1 | 157.76 | | 256.97| | Age | 2 | 93.46 | 0.41 | 234.7| | Height | 2 | 120.28 | 0.24 | 247.31| | Age + Height | 3 | 47.75 | 0.7 | 205.04| -- - From the four models that we compare using only the continuous variables in the study, we found that the one that accounted for our observations better was the multiple linear regression. -- - This suggests that both age and height are good predictors of a participant's blood pressure. -- - However, there is a categorical variable that have not taken into account which could play a role. --- ## Blood pressure by sex - The other variable that we have left is the sex at birth of our participants. -- - To see if this variable plays a role we can look at the plots of blood pressure as a function of age and height but this time, we can use color to distinguish between the categories in our third independent variable. -- - We will start using height as the continuous variable -- .pull-left[ <img src="data:image/png;base64,#lec-21_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="data:image/png;base64,#lec-21_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- ## Adding categorical variables - From the two graphs before we can see that whether a participant is male or female at birth has an impact on their blood levels. -- - Another way to think about this effect is in terms of populations. -- - There is one population (females) that seem to have lower blood pressure levels on average in comparison to the second population (males). -- - The relation between height, blood pressure and sex at the time of birth is not that clear. -- - However, the relation between age, blood pressure and sex at the time of birth seems straight forward. On average, females seem to have lower blood pressure levels across all ages in comparison to the males. -- - Additionally, the effect of age seems to be the same for both groups. In other words, the average change in blood pressure associated with age seems to be the same in both groups. --- ## Adding categorical variables - Now the problem is how can we add this categorical variable to our linear models? -- - Like we mentioned when we worked with factorial designs, there are two ways to include a categorical variable in these models. -- 1. Additive: we assume that the expected value of our dependent variable is different for each group, but the effect of other variables in the model is not affected. -- 1. Interactions: The level of the categorical variable changes the effect of a continuous independent variable in the model. -- - How does this translate into our linear regression models? --- ## Additive effect of a categorical variable - When we say that a categorical variable has an additive effect, what we mean is that the values for each group will be different depending on the level of the category. -- - This means that the effect of other variables in the model does not change. -- - In order to introduce a categorical variable to the model the first thing that we need to do is assign it numerical values. -- - In other words we need a function that takes the "labels" of the category and assigns a number to those labels. -- - In our example we have sex at birth as a categorical variable that can take the labels "female" and "male", so we can create a new variable that assigns the value `\(0\)` to males and the value `\(1\)` to females. --- ## Adding categorical variables - We need some notation to introduce these variables, let `\(z_i\)` be the value assigned to the *i-th* observation for our new variable, `\(z_i\)` is defined as: `$$z_i = \begin{cases} 0 & \quad \text{if observation i is male}\\ 1 & \quad \text{if observation i is female} \end{cases}$$` -- - We have used this type of variable before, `\(z_i\)` is what we call an indicator function, in this case it is an indicator that the *i-th* observation takes the label "female". -- - Now we can use our new variable `\(z_i\)` and formalize new models that include both categorical and continuous variables as predictors. -- - For example, a model that assumes that only sex at birth is a good predictor of blood pressure levels can be formalized as: `$$y_i \sim \text{Normal}(\beta_0 + \beta_3 z_i, \sigma_5^2)$$` -- `$$\text{blood pressure}_i \sim \text{Normal}(\beta_0 + \beta_3 \text{sex at birth}_i, \sigma_5^2)$$` --- ## Adding categorical variables - The interpretation of our parameters will change when we introduce a categorical variable for the model. -- - In the model that we just described the prediction of the model for a given observation is: `$$\beta_0 + \beta_3 z_i$$` -- - We known that if the *i-th* observation belongs to the "female" group then `\(z_i\)` will take the value of 1, which means that the expected blood pressure of female participants is equal to `$$\beta_0 + \beta_3$$` -- - However, if the observation belongs to the "male" group then our new variable `\(z_i\)` takes the value of 0, and thus, the prediction of the model would be equal to `$$\beta_0$$` --- ## Interpretation of the parameters - In other words for the model that assumes that **only** the sex at birth of our participants is a predictor of blood pressure and when we assign the value `\(1\)` to female participants and `\(0\)` to males, we can interpret our parameters as: -- - `\(\beta_0\)`: the expected blood pressure of male participants. -- - `\(\beta_3\)`: change in expected blood pressure for female participants when compared to male participants. -- - In this case, the population assigned to the value "$0$" is known as the base-line group. -- - Then we interpret the parameter associated to the value of our independent variable as the difference between the second group and the base-line. --- ## Comparisons between two groups and linear regression - If the interpretation of the parameters in the linear regression with a single categorical variable sounds somewhat familiar this is because it is another approximation to the comparison between two groups. -- - On week two we talked about how to compare the responses of two groups, and said that we could compare two models, one that assumed that the average of all participants was the same which we denoted with `\(\mu\)` and a second model which assumed that the expected value of the response was different between groups, which we denoted as `\(\mu_1\)` and `\(\mu_2\)`. -- - This new linear regression model is a solution to the same problem! -- - However, the parameters of this model have a different interpretation, in this case, `\(\mu_1 = \beta_0\)` and `\(\mu_2 = \beta_0 + \beta_3\)`. -- - The linear regression model plays a very important role in statistics due to its flexibility, this is just one example. -- - Next class we will show that we can have both continuous and categorical variables in a single linear model and how can we interpret our results with such models.